Introduction

This unit will introduce you to the basics of collecting, analyzing and visualizing data as well as making database decisions.

Data Basics

We will discuss data basics or more specifically, we will touch on three main concepts, observations, variables, and data matrices, types of variables, and relationships between variables.
Data are organized in what we call a data matrix, where
- each row: represents an observation or a case.
- each column represents a variable.

Screenshot taken from Coursera 0:43

There are two types of variables, numerical and categorical.

Numerical in other words quantitative variables, take on numerical values. It is sensible to add, subtract, take averages, etc., with these values.
Categorical or qualitative variables, take a unlimited number of distinct categories, these categories can be identified with numbers for example, it is customary to see the gender of variable coded as zero for males and one for females. But it wouldn't be sensible to do arithmetic operations with these values. They're nearly place holders for the levels of the category of variable.

Screenshot taken from Coursera 1:16

Numerical variables

Numerical variables can further be categorized as continuous or discrete.

Continuous numerical variables are usually measured, such as height. These variables can take on any number of infinite values given within a given range.
Discreet numerical variables are those that take on one of a specific set of numeric values where we're able to count or enumerate all of the possibilities. One example of a discreet variable is the number of cars a household owns. In general, count data are an example of discrete variables.

Screenshot taken from Coursera 2:14

Categorical variables

Categorical variables that have ordered levels are called ordinal. Think about a survey question where you're asked how satisfied you are with the customer service you received and the options are very unsatisfied, unsatisfied, neutral, satisfied and very satisfied. These levels have an inherent ordering, hence the variable would be called ordinal.
If the levels of a categorical variable do not have an inherent ordering to them then the variable is simply called categorical. For example, are you a morning person or an afternoon person?

Screenshot taken from Coursera 2:48

Screenshot taken from Coursera 3:40

Screenshot taken from Coursera 4:15

Relationship between variables

Here, we have a scatter plot of the user data requests by countries and the compliance rate by Google. We can see that on average as the number of requests increases, so does the compliance rate. And that there is one country that sticks out as a potential outlier with much higher user data requests than the others. That's actually the United States.
associated or dependent : When two variables show some connection with one another. The association can be further described as positive or negative, and for these variables the association appears to be positive.
independent variables: If two variables are not associated.

Screenshot taken from Coursera 5:03

Observational studies and experiments

We will define observational studies and experiments and discuss correlation and causation.

Observational

In an observational study, researchers collect data in a way that does not directly interfere with how the data arise. In other words, they merely observe. And based on observational studies, we can only establish an association. In other words, correlation between the explanatory and the response variables.
If an observational study uses data from the past, it's called a retrospective study.
Whereas if data are collected throughout the study, it's called prospective.

Experiment

In an experiments on the other hand, researchers randomly assign subjects to treatments and can, therefore, establish causal connections between the explanatory and response variables.

Screenshot taken from Coursera 0:43

Screenshot taken from Coursera 3:11

The title of the article says, breakfast cereal keeps girls slim, but there actually three possible explanations here.

One, eating breakfast does indeed, cause girls to be slimmer.
Two, being slim might cause girls to eat breakfast, so the relationship could be reversed.
Three, there may be a third variable that is responsible for both being slim and eating breakfast. For example, generally being health conscious might result in being slim as well as starting the day off with breakfast.

Screenshot taken from Coursera 3:52

Such extraneous variables that affect both the explanatory and the response variable and that make it seem like there's a relationship between them are called confounding variables.

Screenshot taken from Coursera 3:59

If you're going to walk away with one thing from this class, let it be correlation does not imply causation.
And what determines whether we can infer causation or just correlation is the type of study that we're basing our conclusions.
Observational studies for the most part, allow us to make only correlational statements.
While experiments, allow us to infer causation. We said for the most part, because there are actually more advanced methods broadly titled causal inference that allow for making causal inferences for observational studies, but those studies are beyond the methods for this

Screenshot taken from Coursera 4:05

Screenshot taken from Coursera 4:39

Sampling and sources of bias

Screenshot taken from Coursera 4:19

Screenshot taken from Coursera 8:00

Experimental Design



In [ ]:

Table of Contents